Demystify Data Backfilling. Letâs talk about data engineersââ¦ | by Xiaoxu Gao

Demystify Data Backfilling. Letâs talk about data engineersââ¦ | by Xiaoxu Gao | Nov, 2023

Let’s talk about data engineers’ nightmare

10 min read

1 hour ago

As data engineers, we encounter unique challenges every day. But if there is one daunting task that stands out, it must be the backfill. A flawed backfill means excessive processing time, data contamination, and substantial cloud bills. And yeah, it also means you need one more backfill job to fix it.

Completing your first successful data backfill is a data engineering rite of passage. â Dagster

Backfill task demands a set of data engineering skills to be effectively accomplished such as domain knowledge to validate results, tooling expertise to run backfill jobs, and a solid understanding of the database to optimize the process. When all of these elements are intertwined within a single task, things can go wrong.

In this article, we will explore the concept of data backfilling, its necessity, and efficient implementation methods. Whether you are a beginner in backfilling or someone who often feels panic about such tasks, this article will calm your mind and help you regain your confidence.

What is backfill?

Backfill is the process of filling in missing data from the past on a new table that didn’t exist before, or replacing old data with new records. It’s usually not a recurring job and it’s necessary only for data pipelines that update the table incrementally.

Difference between regular job and backfill job (created by author)

For example, a table is partitioned on date column. A regular daily job updates just the latest 2 partitions. In contrast, a backfill job can update partitions all the way back to the initial one in the table. If the regular job updates the entire table each time, a backfill job becomes unnecessary as the historical data will naturally be updated through the regular job.